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Figure 1. We present TripoSR, a 3D reconstruction model that reconstructs high-quality 3D from single images in under 0.5 seconds. Our 
model achieves state-of-the-art performance and generalizes to objects of various types and input images across different domains. 


Abstract 


This technical report introduces TripoSR, a 3D recon- 
struction model leveraging transformer architecture for fast 
feed-forward 3D generation, producing 3D mesh from a 
single image in under 0.5 seconds. Building upon the 
LRM [11] network architecture, TripoSR integrates sub- 
stantial improvements in data processing, model design, 
and training techniques. Evaluations on public datasets 
show that TripoSR exhibits superior performance, both 
quantitatively and qualitatively, compared to other open- 
source alternatives. Released under the MIT license, Tri- 
POSR is intended to empower researchers, developers, and 
creatives with the latest advancements in 3D generative AI. 


“Equal advising. 


Model: https://huggingface.co/stabilityai/Tr 
ipoSR 

Code: https://github.com/VAST-AI-Research/Tr 
ipoSR 

Demo: https://huggingface.co/spaces/stability 
ai/TripoSR 


1. Introduction 


The landscape of 3D Generative AI has witnessed a con- 
fluence of developments in recent years, blurring the lines 
between 3D reconstruction from single or few views and 
3D generation [3, 9, 11, 13, 17, 29, 33-35]. This conver- 
gence has been significantly accelerated by the introduc- 
tion of large-scale public 3D datasets [4, 5] and advances 
in generative model architectures. Comprehensive reviews 


of these technologies can be found in the literature such as 
[15] and [22]. 

To overcome the scarcity of 3D training data, recent ef- 
forts have explored utilizing 2D diffusion models to cre- 
ate 3D assets from text prompts [20, 21, 27] or input im- 
ages [17, 23]. DreamFusion [20], a notable example, intro- 
duced score distillation sampling (SDS), employing a 2D 
diffusion model to guide the optimization of 3D models. 
This approach represents a pivotal strategy in leveraging 2D 
priors for 3D generation, achieving breakthroughs in gener- 
ating detailed 3D objects. However, these methods typically 
face limitations with slow generation speed, due to the ex- 
tensive optimization and computational demands, and the 
challenge of precisely controlling the output models. 

On the contrary, feed-forward 3D reconstruction models 
achieve significantly higher computational efficiency [7, 8, 
11-14, 17, 19, 24-26, 28, 31, 32, 35]. Several recent ap- 
proaches [11, 13, 14, 17, 24, 26, 28, 31, 35] along this di- 
rection have shown promise in scalable training on diverse 
3D datasets. These approaches facilitate rapid 3D model 
generation through fast feed-forward inference and are po- 
tentially more capable of providing precise control over the 
generated outputs, marking a notable shift in the efficiency 
and applicability of these models. 

In this work, we introduce TripoSR model for fast feed- 
forward 3D generation from a single image that takes less 
than 0.5 seconds on an A100 GPU. Building upon the 
LRM [11] architecture, we introduce several improvements 
in terms of data curation and rendering, model design and 
training techniques. Experimental results demonstrate su- 
perior performance, both quantitatively and qualitatively, 
compared to other open-source alternatives. Figure | shows 
some sample results of the TripoSR. TripoSR is made avail- 
able under the MIT license, accompanied by source code, 
the pretrained model, and an interactive online demo. The 
release aims to enable researchers, developers, and cre- 
atives to advance their work with the latest advancements in 
3D generative AI, promoting progress within the wider do- 
mains of AI, computer vision, and computer graphics. Next, 
we introduce the technical advances in our TripoSR model, 
followed by the quantitative and qualitative results on two 
public datasets. 


2. TripoSR: Data and Model Improvements 


The design of TripoSR is based on the LRM [1 1], with a se- 
ries of technical advancements in data curation, model and 
training strategy. We now give an overview of the model 
followed by our technical improvements. 


2.1. Model Overview 


Similar to LRM [11], TripoSR leverages the transformer ar- 
chitecture and is specifically designed for single-image 3D 
reconstruction. It takes a single RGB image as input and 


outputs a 3D representation of the object in the image. The 
core of TripoSR includes components: an image encoder, 
an image-to-triplane decoder, and a triplane-based neural 
radiance field (NeRF). 

The image encoder is initialized with a pre-trained vision 
transformer model, DINOv1 [1], which projects an RGB 
image into a set of latent vectors. These vectors encode 
the global and local features of the image and include the 
necessary information to reconstruct the 3D object. 

The subsequent image-to-triplane decoder transforms 
the latent vectors onto the triplane-NeRF representation [2]. 
The triplane-NeRF representation is a compact and expres- 
sive 3D representation, well-suited for representing objects 
with complex shapes and textures. Our decoder consists 
of a stack of transformer layers, each with a self-attention 
layer and a cross-attention layer. The self-attention layer al- 
lows the decoder to attend to different parts of the triplane 
representation and learn relationships between them. The 
cross-attention layer allows the decoder to attend to the la- 
tent vectors from the image encoder and incorporate global 
and local image features into the triplane representation. Fi- 
nally, the NeRF model consists of a stack of multilayer per- 
ceptrons (MLPs), which are responsible for predicting the 
color and density of a 3D point in space. 

Instead of conditioning the image-to-triplane projection 
on camera parameters, we have opted to allow the model to 
“guess” the camera parameters (both extrinsics and intrin- 
sics) during training and inference. This is to enhance the 
model’s robustness to in-the-wild input images at inference 
time. By foregoing explicit camera parameter condition- 
ing, our approach aims to cultivate a more adaptable and re- 
silient model capable of handling a wide range of real-world 
scenarios without the need for precise camera information. 

The architecture’s main parameters, such as the number 
of layers in the transformer, the dimensions of the triplanes, 
the specifics of the NeRF model, and the main training con- 
figurations, are detailed in Table 1. Compared to LRM [11], 
TripoSR introduces several technical improvements which 
we discuss next. 


2.2. Data Improvements 


Recognizing the critical importance of data, we have incor- 
porated two improvements in our training data collection: 


Data Curation: By selecting a carefully curated subset 
of the Objaverse [4] dataset, which is available under the 
CC-BY license, we have enhanced the quality of training 
data. 

Data Rendering: We have adopted a diverse array of 
data rendering techniques that more closely emulate the 
distribution of real-world images, thereby enhancing the 
model’s ability to generalize, even when trained exclu- 
sively with the Objaverse dataset. 


Parameter Value 
image resolution 512 x 512 
J patch size 16 
Image Tokenizer : 
attention layers 12 
feature channels 768 
; ; k 2 x 32x 
Triplane Tokenizer FOR 3 2 : 
channels 16 
channels 1024 
attention layers 16 
Backbone attention heads 16 
attention head dim 64 
cross attention dim 768 
factor 2 
l input channels 1024 
Triplane Upsampler 
1p P P output channels 40 
output shape 64 x 64 x 40 
width 64 
NeRF MLP layers 10 
activation SiLU 
samples per ray 128 
radius 0.87 
R 
enderer density activation exp 
density bias —1.0 
learning rate 4e—4 
optimizer AdamW 
Trainin lr scheduler CosineAnnealingLR 
8 # warm-up steps 2, 000 
ALPIPS 2.0 
Amask 0.05 


Table 1. Model configuration of TripoSR. 


2.3. Model and Training Improvements 


Our adjustments aim to boost both the model’s efficiency 
and its performance. 


Triplane Channel Optimization. The configuration of 
channels within the triplane-NeRF representation plays an 
important role in managing the GPU memory footprint dur- 
ing both training and inference, due to the high computa- 
tional cost of volume rendering. Moreover, the channel 
count significantly influences the model’s capacity for de- 
tailed and high-fidelity reconstruction. In pursuit of an op- 
timal balance between reconstruction quality and computa- 
tional efficiency, experimental evaluations led us to adopt a 
configuration of 40 channels. This choice enables the use of 
larger batch sizes and higher resolutions during the training 
phase, while concurrently minimizing the memory require- 
ments during inference. 


Mask Loss. We incorporated a mask loss function during 
training that significantly reduces “floater” artifacts and im- 


proves the fidelity of reconstructions: 
Linask(My, M7”) = BCE(M,,My"), (1) 


where M, and M T are rendered and ground-truth mask 
images of the v-th supervision view, respectively. The full 
training loss we minimized during training is: 


1 V 


Lrecon( T) = D (Cuse(fo, 167) 
= (2) 


+ ArpwsLiprps(I,, IST) 


Andr M M§")) 


Local Rendering Supervision. Our model fully relies on 
rendering losses for supervision, thereby imposing a need 
for high-resolution rendering for our model to learn detailed 
shape and texture reconstructions. However, rendering and 
supervising at high resolutions (e.g., 512 x 512 or higher) 
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Figure 2. We outperform SOTA methods for 3D reconstruction 
while achieving fast inference time. In the figure, F-Score with 
threshold 0.1 is averaged over GSO [6] and OmniObject3D [30]. 


can overwhelm computational and GPU memory loads. To 
circumvent this issue, we render 128 x 128-sized random 
patches from the original 512 x 512 resolution images dur- 
ing training. Crucially, we increase the likelihood of se- 
lecting crops that cover foreground regions, thereby plac- 
ing greater emphasis on the areas of interest. This impor- 
tance sampling strategy ensures faithful reconstructions of 
object surface details, effectively balancing computational 
efficiency and reconstruction granularity. 


3. Results 


We quantitatively and qualitatively compare TripoSR 
to previous state-of-the-art methods using two different 
datasets with 3D reconstruction metrics. 

Evaluation Datasets. We curate two public datasets, 
GSO [6] and OmniObject3D [30], for evaluations. We iden- 
tify that both datasets include many simple-shaped objects 
(e.g., box, sphere or cylinder) and can thus cause high vali- 
dation bias towards these simple shapes. Therefore we man- 
ually filter the datasets and select around 300 objects from 
each dataset to make sure they form a diverse and represen- 
tative collection of common objects. 

3D Shape Metrics. We extract the isosurface using March- 
ing Cubes [18] to convert implicit 3D representations (such 
as NeRF) into meshes. We sample 10K points from these 
surfaces to calculate the Chamfer Distance (CD) and F- 
score (FS). Considering that some methods are not capa- 
ble of predicting view-centric shapes, we use a brute-force 
search approach to align the predictions with the ground 
truth shapes. We linearly search the rotation angle by op- 
timizing for the lowest CD and further employ the Iterative 


Method CDI FS@0.1¢  FS@0.2t  FS@0.5f 
One-2-3-45 [16] 0.227 0.382 0.630 0.878 
ZeroShape [13] 0.160 0.489 0.757 0.952 
TGS [35] 0.122 0.637 0.846 0.968 
OpenLRM [10] 0.180 0.430 0.698 0.938 
TripoSR (ours) 0.111 0.651 0.871 0.980 


Table 2. Quantitative comparison of different techniques on 
GSO [6] validation set, where CD and FS refer to Chamfer Dis- 
tance and F-score respectively. 


Method CD) FS@0.1f  FS@0.2}  FS@05T 
One-2-3-45 [16] 0.197 0.445 0.698 0.907 
ZeroShape [13] 0.144 0.507 0.786 0.968 
TGS [35] 0.142 0.602 0.818 0.949 
OpenLRM [10] 0.155 0.486 0.759 0.959 
TripoSR (ours) 0.102 0.677 0.890 0.986 


Table 3. Quantitative comparison of different techniques on Om- 
niObject3D [30] validation set, where CD and FS refers to Cham- 
fer Distance and F-score respectively. 


Closest Point (ICP) method to refine the alignment. 
Quantitative Comparisons. We compare TripoSR with 
the existing state-of-the-art baselines on 3D reconstruc- 
tion that use feed-forward techniques, including One-2-3- 
45 [16], TriplaneGaussian (TGS) [35], ZeroShape [13] and 
OpenLRM [10]!. As shown in Table 2 and Table 3, our 
TripoSR significantly outperforms all the baselines, both in 
terms of CD and FS metrics, achieving the new state-of-the- 
art performance on this task. 

Performance vs. Runtime. Another key advantage of Tri- 
poSR is its inference speed. It takes around 0.5 seconds to 
produce a 3D mesh from a single image on an NVIDIA 
A100 GPU. Figure 2 shows a 2D plot of different tech- 
niques with inference times along the x-axis and the av- 
eraged F-Score along the y-axis. The plot shows that Tri- 
poSR is among the fastest networks, while also being the 
best-performing feed-forward 3D reconstruction model. 
Visual Results. We further show the qualitative results of 
different approaches in Figure 3. Because some methods do 
not reconstruct textured meshes, we render TripoSR recon- 
structions both with and without vertex color for a better 
comparison. As shown in the figure, ZeroShape tends to 
predict over-smoothed shapes. TGS reconstructs more sur- 
face details but these details sometimes do not align with 
the input. Moreover, both ZeroShape and TGS cannot out- 
put textured meshes directly 7. On the other hand, One-2- 
3-45 and OpenLRM predict textured meshes, but their esti- 


'We use the openlrm-large-obj-1.0 model. 

?TGS leverages 3DGS to represent 3D objects. We follow the paper 
and utilize their auxiliary point cloud outputs to reconstruct the surface. 
However, it is non-trivial to reconstruct textures on meshes, (e.g., directly 
taking vertex colors from the nearest Gaussian leads to noisy textures). 
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Figure 3. Qualitative results. We compare TripoSR output meshes to other SOTA methods on GSO and OmniObject3D (first four columns 
are from GSO [6], last two are from OmniObject3D [30]). Our reconstructed 3D shapes and textures achieve significantly higher quality 


and better details than previous state-of-the-art methods. 


mated shapes are often inaccurate. Compared to these base- 
lines, TripoSR demonstrates a high reconstruction quality 
for both shape and texture. Our model not only captures a 
better overall 3D structure of the object, but also excels at 
modeling several intricate details. 


4. Conclusion 


In this report, we present an open-source feedforward 3D 
reconstruction model, TripoSR. The core of our model is 
a transformer-based architecture developed upon the LRM 
network [11], together with substantial technical improve- 
ments along multiple axes. Evaluated on two public bench- 
marks, our model demonstrates state-of-the-art reconstruc- 
tion performance with high computational efficiency. We 
hope TripoSR empowers researchers and developers in de- 
veloping more advanced 3D generative AI models. 
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